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Speaker-trained, isolated word recognizers have achieved notable 
success in a wide variety of applications. The training for such 
systems generally involves a single (or sometimes two) replication(s) 
of each word of the vocabulary by the designated talker. Word 
reference templates are then formed directly from these replications. 
In recent work on speaker-independent word recognition, it has been 
shown that statistical clustering procedures provided an effective 
way for determining the structure in multiple replications of a word 
by different talkers. Such techniques were then used to provide a set 
of reference templates based on the clustering results. In this paper 
we discuss the application of clustering techniques to speaker-trained 
word recognizers. It is shown that significant improvements in rec- 
ognition accuracy are obtained when using templates obtained from 
a clustering analysis of multiple replications of a word by the desig- 
nated talker. It is also shown that recognition accuracy did not 
change with time (over a 6-month period) for any of the subjects 
tested, thereby indicating that the reference templates were reason- 
ably stable. 

I. INTRODUCTION 

Although a great deal has been learned about isolated word speech 
recognition systems, 1 " 14 several key issues are not as well understood 
as others. One such issue is the manner in which the word reference 
templates for such a system are obtained. To date, there have been at 
least three distinct ways of obtaining templates, including: 

(i) Casual training in which the designated talker (for a speaker- 
trained system) speaks each word of the vocabulary (one or more 
times) and a reference template is created for each spoken word. 3,4 
Thus, for casual training, there is a direct correspondence between a 
spoken token of the word and the reference template. 
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(ii) Averaging methods in which the designated talker (for a 
speaker-trained system) or a set of talkers (for a speaker-independent 
system) speaks the word a number of times and a weighted, time- 
normalized average of the feature sets for that word is used as the 
reference template. 1,715 

(Hi) Statistical clustering methods in which a set of talkers speak 
the word and a statistical pattern recognition algorithm is used to 
group the feature sets of the tokens into a set of clusters. 1416 The 
similarity of tokens within a cluster is high (small intratoken distances), 
whereas the similarity of tokens in different clusters is low (large 
intertoken distances). Reference templates are obtained by represent- 
ing each cluster by a single template (either using a minimax ap- 
proach, 14 or via averaging techniques 17 ). Thus, a word is generally 
represented by a set of templates rather than one or two templates. 

The third method above, the statistical approach, has been success- 
fully applied to a speaker-independent word recognizer for a variety of 
vocabularies. 1417,18 It is the purpose of this paper to show how this 
technique can be applied in a speaker-trained system to further in- 
crease their accuracy and robustness over systems in which the refer- 
ence templates are obtained by casual training. 

The organization of this paper is as follows. In Section II we review 
the operation of the basic word recognizer and the clustering proce- 
dures. In Section III we present the experimental procedures used to 
obtain the data for training and testing the system. The statistics of 
the clustering for each of three talkers are presented in Section IV, 
and the recognition accuracy as a function of key system parameters 
is given in Section V. Finally, Section VI discusses the results and their 
implications for practical implementations of word recognition sys- 
tems. 



II. REVIEW OF THE WORD RECOGNITION SYSTEM 

The word recognizer, shown in Fig. 1, is similar to the one originally 
proposed by Itakura, 3 and has been used in a variety of 
applications. 4,13,14,16 " 18 Telephone line input signals (100- to 3200-Hz 
bandwidth) are digitized at a 6.67-kHz rate, and a p = 8th-order 
autocorrelation analysis is performed on overlapping frames of N = 
300 samples (45 ms), with an overlap of 200 samples between frames. 
Prior to the autocorrelation analysis, each frame of data is preem- 
phasized with a first-order digital network with transfer function 
(1 — 0.96 2 _1 ) and windowed by a 300-sample Hamming window. 
If we denote the /th preemphasized, windowed frame of speech as 
xi(n), < n < N - 1, then 

x,{n) =x(l>S + n)-w(n) < n < N - 1, 0</<L-l, (1) 

2218 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1979 



I 



(/JO 

z>< 

_i a. 






O LU 

P.3 


















£ lu 5 


? 




oC 


Q 


CC -J Ul 


1- OH 


> 


II II II 







<rfo 



o w 




o 

o 




^ 


1 I 


r- 


« = 


8 


" 





-o~~f-o— 



APPLICATION OF CLUSTERING TECHNIQUES 2219 



where x(n) is the preemphasized speech, w(n) is a Hamming window, 
S = 100 samples (15 ms) is the shift in samples between adjacent 
frames, and L is the number of frames in the recording interval. The 
autocorrelation coefficients of the Zth frame, Ri(m) are given by 

N-l 

Ri(m) = X xi(n)x t (n + m) < m < p (2) 

n-0 
N-l 

= £ x(lS + n)x(lS + n + m). (3) 

n-0 

The zeroth autocorrelation coefficient of each frame (Ri(0)) is the 
energy in the frame. The time pattern of Ri(0) (i.e., Rt(Q) vs /) is used 
to determine the end-point boundaries of the spoken, isolated word in 
a simple manner based on the measured energy of the background 
noise in the recording environment. 14 

As noted in Fig. 1, there are three modes of operation of the word 
recognizer, namely, training, clustering, and testing (normal usage). As 
discussed earlier, for many speaker-trained systems, the training mode 
is simply a recording of each word of the vocabulary and the clustering 
(i.e., formation of word reference templates) is a direct conversion 
from stored autocorrelation coefficients to the format required for the 
word reference template. (For this recognizer, the word reference 
templates are stored as frames of autocorrelated linear predictive 
coding (lpc) coefficients. This is explained later in this section.) For a 
statistical clustering approach, however, training consists of storing 
sets of autocorrelation coefficients for a number of replications of each 
word of the vocabulary, and clustering consists of grouping the repli- 
cations of each word into clusters and creating a single word reference 
template for each cluster. 

For the third mode of the system, namely the testing or normal 
usage mode, following end-point detection, an lpc analysis of each 
frame is performed (the autocorrelation method), and each autocor- 
relation frame is converted to a normalized form as follows. If we 
denote the frames of autocorrelations in the test word as Ri(m), i = 
1, 2, • ■ • , NT, and the lpc prediction residual of each frame as Ei, then 
the test frame parameters are given as 

Viim)"-^- 0<m<p, l<i<iV7. (4) 

For the word reference templates if we denote the y'th frame of 
lpc coefficients (derived from the autocorrelation coefficients) as 
dj(k), < k < p, then the y'th reference frame parameter set is given as 

p 
Pj(m) = 2 J aj(k)aj(k + m) 1 < m <p (5a) 
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= l [dj{k)f m = 0. (5b) 

*-o 

A distance can now be defined between the ith test frame and the yth 
reference frame as ' 



d(i,j)=d(V„Pj)= log 



£ K(m)P,(ro) 

m=0 



(6) 



The distance measure of eq. (6) has been shown to be an effective 
measure for comparing sets of lpc coefficients in a variety of applica- 
tions, 3 1922 and it can be computed with (p + 1) multiplications and 
additions and one logarithm. 

The next step in the recognizer is to compare the test word against 
each stored word reference template. A dynamic time-warping algo- 
rithm is used to optimally align in time the test and reference patterns 
and to give the average distance associated with the optimal warping 
path. The average distance for the qth template of the rth reference 
word is 



D r . q = tt^ 



1 
NT 



NT I 

mm % d(i, w r . q (i)) \, (7) 



Jm 



where w r , g (i) is the optimally determined warping path. The final step 
in the process is to choose the recognized word based on the set of 
average distances D r , q . The most common decision rule is the minimum 
distance rule which chooses the word r* such that 

Dr-o'^Drg all r, q (8) 

for some value q*. An alternative and more powerful decision rule (for 
the case of multiple reference templates) is the it-nearest neighbor 
rule (knn), which says that for each word r, the distances D r , q are 
reordered according to average distance so that 

r , m < A-, l2] < ... <D r , [Ql , (9) 

where Q is the number of templates for the rth word, and the knn rule 
says to choose the word r* such that 

K K 

k-l A=l 

For K = 1 the knn rule is the minimum distance rule. Unless otherwise 
noted, a value of K = 2 was used in the recognition tests in this paper. 

2. 1 The clustering procedure 

The clustering analysis is based on the fully automatic technique 
(unsupervised with averaging — uwa) described in Ref. 17. It was 
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assumed that we begin with M replications of each word in the 
vocabulary and, based on the pairwise dynamic time-warped average 
distance between words, the M tokens are grouped into P disjoint 
clusters, to,, such that 

0-[*i,fe, ...,**]- U tat, (11) 

where t\, £2, • • • , £m are the M tokens in the set. The total number of 
clusters, P, is determined automatically by the clustering procedure. 
Each cluster, co„ is represented by a prototype i,. Based on the work 
of Rabiner and Wilpon, 17 the tokens within cluster to, are averaged 
(using dynamic time warping for time alignment) to give the prototype 
Xj. Word reference templates are determined as the prototypes i, 
corresponding to the P largest clusters, i.e., for a single template we 
choose the prototype of the cluster with the most tokens; for a two- 
template representation, we choose the prototypes of the two clusters 
with the largest number of tokens, etc. 

The grouping of the M tokens into P clusters is based on splitting of 
the set £2 by iteratively determining cluster centers (based on a mini- 
max criterion) and cluster points based on a given distance threshold. 
Ultimately, all M tokens are assigned to one of the clusters. A cluster 
may consist of a single outlier token whose distance to all other tokens 
in £2 is greater than the distance threshold of the procedure. The final 
set of P clusters is ordered based on size of the clusters, and the 
averaged centers of the P largest clusters are retained as the P word 
reference templates. 

III. EXPERIMENTAL PROCEDURES 

To test the effectiveness of the clustering analysis for a speaker- 
trained system, three talkers trained the recognizer of Fig. 1. One of 
the three talkers was the first author of this paper. The other two 
talkers were experienced workers in the area of speech processing. All 
three talkers were instructed to speak the words naturally, but in an 
isolated format. No specific motivation for good performance was 
employed, as the talkers' interest in the area was considered sufficient. 
The vocabulary for these tests consisted of the letters A to Z, the digits 
(zero) to 9, and the command words stop, error, and repeat for a 
total of 39 words. This vocabulary is an extremely difficult one (espe- 
cially when recorded over telephone lines, as was done here), but one 
which has utility in a wide range of practical applications. 23 

Each talker spoke the 39-word vocabulary (in a random order) three 
times per session over a one-month period for a total of 50 replications 
for each word in the vocabulary. A total of 17 sessions was used, with 
only two recordings in the last session. 
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A clustering analysis was performed for each talker, and a set of 
word reference templates was obtained. For speaker-independent sys- 
tems, a total of = 12 reference templates per word was used. For 
comparison purposes, the same number of templates was obtained for 
the speaker-trained vocabulary. However, results are also given for a 
variable number of templates per word. 

To test the system, each of the three talkers spoke the 39-word 
vocabulary five times per session for a total of 10 sessions. Each session 
was at least two weeks after the preceding one; thus, a total of at least 
20 weeks was used to obtain the 10 test sets. 

Additional analyses were performed to show the effects of reduced 
training on the recognition accuracy. To do this, we simply used fewer 
training runs in the clustering analysis. As such, results are presented 
for cluster sets based on 24, 12, and 6 replications of the word vocab- 
ulary during the training phase. 



IV. CLUSTER STATISTICS 

Based on the clustering analysis, a set of objective statistics on the 
clusters can be given which indicates how the tokens cluster. In 
accordance with past experience with these clustering algorithms, the 
following statistics appear to be most meaningful: 

(i) Number of clusters per word. A cluster is defined as a set with 
at least two tokens. 

(ii) Number of outliers per word. An outlier is a token that does 
not fall into one of the clusters above, i.e., its distance to all other 
tokens in the training set exceeded a threshold. 

(Hi) Quality ratio, a, defined as the ratio of the average intercluster 
distance (as defined between cluster prototypes) to the average intra- 
cluster distance (as defined between cluster tokens). 

(iv) Size of largest cluster— i.e., the number of tokens in the largest 
set. 

This set of cluster statistics gives an excellent picture of how the M 
tokens are distributed in the feature space of the problem being 
studied. 

Table I gives the statistics of the word clusters for the three talkers 
used in this investigation. Included in the table are averaged, minimum, 
and maximum values of the cluster statistics for each of the three 
talkers. The statistics in Table I were obtained from clustering the 50 
replications of each word for each talker. It is seen that the average 
values of all statistics are about the same for all three talkers. Typically, 
about 6 clusters per word were sufficient to include all nonoutlier 
tokens. Included in the six clusters were, on the average, 38 of the 50 
tokens, with about 20 of the 50 tokens in the biggest cluster. The 
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Table I — Statistics of the word clusters for the three talkers 

Subjects 





LRR 


AER 


swc 




Avg 


Min 


Max 


Avg 


Min 


Max 


Avg 


Min 


Max 


Number of clusters 
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10 
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10 
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10 


per word 




















Number of outliers 
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17 


11 
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per word 




















Quality ratio (a) 
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1.99 


4.06 


2.92 


2.52 


3.86 
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2.12 


3.41 


Size of largest 
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46 


20 


7 


35 


21 
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31 
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WORD POSITION 



Fig. 2 — Recognition error as a function of word position for the three talkers for 
(a) clustered templates and (b) randomly chosen templates. 

quality ratios of between 2.6 and 2.9 indicate good cluster separation 
for each of the talkers. 16 

V. RECOGNITION RESULTS 

Recognition results on the total of 1950 words (50 replications of the 
39- word vocabulary) for each of the three talkers (lrr, aer were male, 
swc was female) are presented in Figs. 2 and 3. Figure 2a shows a 
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series of plots of the percentage errors as a function of word position 
for the three talkers for reference templates obtained from the cluster- 
ing analysis. Word error rate for the Ath word position is the percentage 
of words which were not within the top k candidates. A total of 12 
templates per word was used in these tests. Overall error rates of 1.4 




2 4 6 8 10 

NUMBER OF TEMPLATES PER WORD 



Fig. 3— Recognition error as a function of the number of templates per word (top 
choice candidate) for the three talkers for (a) clustered templates and (b) randomly 
chosen templates. 
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percent (swc), 1.6 percent (lrr), and 4.4 percent (aer) were obtained 
for the three talkers for the top recognition candidate (i.e., the first- 
choice candidate). The error rate was below 1 percent for the top two 
candidates (word position 2) for all three talkers. 

For comparison purposes, a set of word templates was created by 
randomly choosing tokens from the 50 replications of each word and 
creating one reference template directly from each token. Again, a 
total of 12 templates per word was used. Figure 2b shows the error 
scores for the random set of word templates for the three talkers. 
Overall error rates of 3 percent (swc), 2 percent (lrr), and 5.6 percent 
(aer) were obtained for the top recognition candidate. Although these 
error rates were somewhat higher than the scores obtained from the 
clustered template set, the differences are reasonably small and indi- 
cate that the clustering analysis is unnecessary if we are using 12 
templates per word. In such a case, a random selection of word 
templates is essentially equivalent. 

It is shown in Fig. 3, however, that the results given in Fig. 2 are not 
a complete picture of the effectiveness of the clustering analysis. Fig- 
ure 3a shows plots of percentage error for the top recognition candidate 
as a function of the number of templates per word for the clustered 
template set, and Fig. 3b shows similar plots for randomly chosen 
templates. For talkers lrr and swc, it is readily seen that the error 
rate does not change for more than four templates per word for the 
clustered data. For talker aer, the error rate decreases by about 0.6 
percent as the number of templates per word increases from 6 to 12. 
Thus, Fig. 3a indicates that from 4 to 6 templates per word obtained 
via clustering give comparable recognition accuracies to 12 templates 
per word obtained in the same manner. 

A totally different picture emerges from Fig. 3b for the case of 
randomly chosen templates. (Note that the vertical scale of Fig. 3b is 
different from the vertical scale of Fig. 3a.) It is seen that the percent- 
age error decreases steadily as the number of (random) templates per 
word increases until about 10 templates per word. Thus for randomly 
chosen templates a substantially larger number of templates per word 
are required than for templates obtained from a clustering analysis. 
An alternative way of stating this is that recognition accuracies from 
3 to 4 templates per word (obtained from the clustering analysis) are 
comparable to those obtained from 10 to 12 randomly chosen templates 
per word. 

5. 1 Confusions among the words 

The spoken letters of the alphabet form one of the most difficult of 
word recognition vocabularies because of the high confusability among 
sets of the letters. 14,23 A major advantage of the clustering analysis is 
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that confusions among many of the subsets are entirely eliminated. 
The major confusions, for all three talkers, were in the equivalence 
class of letters containing B, C, D, E, G, P, T, V, and Z. The confusion 
matrices for this class for the three talkers (for 12 templates per word) 
are shown in Table II. For talker swc, 26 of the 27 errors were within 
this equivalence set; for talker lrr all 31 errors occurred within the 
equivalence set; for talker aer, 72 of the 85 recognition errors occurred 
within the equivalence class — however, one confusion was with a word 
outside the set. (Nine of the remaining errors were A, K confusions.) 
Table II shows that each talker had one or more letters in the major 
equivalence class which were hard to reliably recognize; however, for 
all three talkers most letters in the hard equivalence class were 
correctly recognized. This result again demonstrates the power of the 
clustering analysis in determining the structure of each word in the 
vocabulary. 

5.2 Recognition accuracy vs time 

An important aspect of a speaker-trained word recognizer is the 
stability of the reference templates as a function of time. For casually 

Table II — Confusion matrices of the equivalence class with 
B, C, D, E, G, P, T, V, Z for the three talkers 
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trained, nonadaptive systems, the reference templates often degrade 
with time and the system must be retrained. 1 Since training is so 
simple for these systems, this generally does not pose a problem. 
However, some mechanism must be provided for detecting the degra- 
dation of the reference templates and retraining the system. 

For a clustering analysis method of obtaining reference templates, 
it is imperative that the templates be robust in time, i.e., that no 
degradation in recognition accuracy occurs, since the training proce- 
dure is a long and involved one. To demonstrate that the reference 
templates from this system are indeed robust, Fig. 4 shows plots of the 
error rate vs time for each of the three talkers. It is seen that over the 
20-week period of testing, only small changes occur in the recognition 
accuracy. 

5.3 The effects of reduced training 

Since the amount of training used to obtain the accuracies reported 
here was quite extensive (50 repetitions of each word), the effects of 
reduced training on the recognition scores are important to understand. 
Thus, the clustering analysis was redone using subsets of the 50 
replication training data. The subsets included the first 24, 12, and 6 
replications. Before discussing the results, two points should be noted. 
Each recording session consisted of three consecutive replications of 
the word list. Thus the three subsets constitute eight, four, and two 
recording sessions. This is important since it was found that a high 
degree of correlation existed between tokens within a given recording 
session. 

The second point of note is that, for the 12 replications training set, 
the maximum number of clusters was limited to six (including outliers), 
and for the six replication training set the maximum number of clusters 
was limited to four. The reason for this limitation is that more than six 
(or four) meaningful clusters cannot be obtained from the reduced 
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Fig- 4 — Recognition error as a function of time for the three talkers. 
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number of tokens. For recognition purposes, the knn = 1 rule was used 
for the 12 and 6 replication template sets. 

The recognition results for the reduced training sets are shown in 
Figs. 5 and 6. Figure 5 shows plots of percentage error as a function of 
word position for each training set and for each talker. Figure 6 shows 
plots of percentage error vs the number of templates per word for 
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Fig. 5 — Recognition error as a function of the word position for different numbers of 
training sets for the three talkers for both clustered and random templates. 
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Fig. 6 — Recognition error as a function of the number of templates per word and the 
number of training sets for the three talkers for the clustered templates. 

these cases. It is seen that, in all cases, the reduced training set leads 
to increased error in recognition. In reducing the size of the original 
training set (from 50 to 24 replications), the error increased about 
1.5 percent, on the average, for the three talkers. In going from 50 to 
12 replications for training, the error increased by about 2.5 percent 
for the three talkers, and in going from 50 to 6 replications, the error 
increased by about 3.3 percent. 

The results given above indicate that increased training always gave 
better templates from the clustering analysis and reduced the recog- 
nition error rate. 



2230 THE BELL SYSTEM TECHNICAL JOURNAL, DECEMBER 1979 



5.4 Comparisons to casually trained systems 

The recognition system of Fig. 1 was casually trained to each of the 
three talkers by having them speak the vocabulary twice and creating 
reference templates directly from the spoken words. The recognition 
tests were then rerun using the casually obtained word templates. The 
average error rates (over the 50 replications) for the three talkers were 

6.5 percent for talker lrr, 12.9 percent for talker aer, and 12.5 percent 
for talker swc. Since the overall error rates for the clustered data were 
1.6, 4.3, and 1.4 percent for these talkers, respectively, reductions in 
error rate of 3.9, 8.6, and 11.1 percent were obtained. These error 
reductions represent a substantial improvement in the recognition. 



VI. DISCUSSION 

The main result of this paper is the demonstration that statistical 
clustering techniques can be applied equally well to speaker-trained 
word recognizers as they have been to speaker-independent ones. It 
was shown that, with sufficient training and through the use of well- 
developed clustering algorithms, extremely high recognition scores can 
be obtained, even with vocabularies as difficult as the letters of the 
alphabet. This result indicates that, if a user is sufficiently motivated 
to spend the time necessary to train a word recognizer, he can reliably 
use the recognizer with a range of vocabularies in a wide variety of 
applications. 1,23 

An important consideration in a practical implementation of a 
system like the one described in this paper is to keep the number of 
reference templates as small as possible. It was shown that about 4 to 
6 templates per word were sufficient for the given vocabulary. It is 
anticipated that for alternative, less complex vocabularies even fewer 
templates per word would be required. The templates themselves 
appear to be stable with time as the recognition scores did not change 
appreciably through the 20 weeks of testing. 

One point in question about this work is that only three (experi- 
enced) talkers were used. We can only speculate on what the results 
would be for a larger set of talkers. It is believed that the clustering 
approach would be highly effective for any talker. (It should be 
especially good for an inexperienced one who has a lot of replication- 
to-replication variability in the way he says the words.) As such, the 
conjecture is that even larger improvements in recognition accuracy 
over casual training would be obtained by using this system for a wide 
range of talkers. 

Finally, it was shown that the clustering analysis could be bypassed 
with sufficient training, if a large number of randomly chosen templates 
(10 to 12) were used to represent each word in the vocabulary. If 
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computational complexity was not an issue, this result could be useful 
for some applications. 

VII. SUMMARY 

We have shown that statistical clustering techniques can be applied 
to a speaker-trained, isolated word recognition system to provide 
significant improvements in recognition accuracy over casually trained 
systems. The amount of training required for such a system is fairly 
extensive. Thus, this method would probably be limited to applications 
requiring extremely difficult vocabularies (e.g., the letters of the alpha- 
bet), or those in which very high recognition accuracies are required. 
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